library(tidyverse)
colleges <- read_csv("https://www.dropbox.com/s/bt5hvctdevhbq6j/colleges.csv?dl=1")PA 3: Identify the Mystery College 🏫
Data Wrangling with dplyr
Today you will use the dplyr package to clean some data. We will then use that cleaned data to figure out what college Margaret has been accepted to.
This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to produce the best product possible, you will need to use the skills of each member of your group.
Groupwork Protocols
During the Practice Activity, you and your partner will alternate between two roles—Developer and Coder.
When you are the Developer, you will type into the Quarto document in RStudio. However, you do not type your own ideas. Instead, you type what the Coder tells you to type. You are permitted to ask the Coder clarifying questions, and, if both of you have a question, you are permitted to ask the professor. You are expected to run the code provided by the Coder and, if necessary, to work with the Coder to debug the code. Once the code runs, you are expected to collaborate with the Coder to write code comments that describe the actions taken by your code.
When you are the Coder, you are responsible for reading the instructions / prompts and directing the Developer what to type in the Quarto document. You are responsible for managing the resources your group has available to you (e.g., cheatsheet, textbook). If necessary, you should work with the Developer to debug the code you specified. Once the code runs, you are expected to collaborate with the Developer to write code comments that describe the actions taken by your code.
Group Norms
Remember, your group is expected to adhere to the following norms:
- Think and work together. Do not divide the work.
- You are smarter together.
- No cross-talk with other groups.
Part One: Data Import and Cleaning
This section will clean today’s data so that you can use it more easily in Part Two.
The partner whose birthday comes last starts as the Developer (typing and listening to instructions from the Coder)!
Data Download & Package Loading
First, we declare our package dependencies and load the data.
The data loading function read_csv() will give you an outpouring of helpful information about the dataset. If you do not see the word “error”, there is nothing to be concerned about.
Take a look at the variables in your downloaded data by running the following code. This tells reports the data type for each column in the dataset.
glimpse(colleges)Data Cleaning
Now we will clean the data. Alas, each of the R chunks in this section will cause an error and / or do the desired task incorrectly. Even the chunks that run without error are not correct! You will need to find the mistake and correct it to complete the intended action.
Step 1: There are too many variables in this data set. We don’t need all of them. Narrow your data set down to only:
INSTNMname of the institutionCITYcity,STABBRstate, andZIPZIP code of the institutionADM_RATEadmissions rateSAT_AVGaverage SAT scoreUGDSnumber of undergraduate studentsTUITIONFEE_INin- andTUITIONFEE_OUTout-of-state tuitionCONTROLWhether the school is public or privateREGIONregion of the school.
colleges_clean <- colleges | >
select(INSTNM, CITY, STABBR, ZIP,
ADM_RATE, SAT_AVG, UGDS,
TUITIONFEE_IN, TUITIONFEE_OUT
CONTROL, REGION) Step 2: Remove the schools that are for-profit (category 3), keeping public (category 1) and private schools (category 2).
colleges_clean <- colleges_clean |>
filter(CONTROL == 1, CONTROL == 2)Step 3: Adjust the appropriate variables to be numeric, using as.numeric().
colleges_clean <- colleges_clean |>
mutate(TUITIONFEE_IN = numeric(TUITIONFEE_IN),
TUITIONFEE_OUT = numeric(TUITIONFEE_OUT),
SAT_AVG = numeric(SAT_AVG),
UGDS = numeric(UGDS),
ADM_RATE = numeric(ADM_RATE)) Step 4: Adjust the appropriate variables to be factors, using as.factor().
colleges_clean <- colleges_clean |>
mutate(CONTROL = as.character(CONTROL),
REGION = as.character(REGION))Step 5: Create a new variable called TUITION_DIFF which contains the difference between out-of-state and in-state costs.
colleges_clean |>
TUITION_DIFF = TUITIONFEE_OUT - TUITIONFEE_INStep 6: Create a new variable called TOTAL_IN which contains the total amount of money made from tuition per year.
colleges_clean <- colleges_clean |>
select(TOTAL_IN = UGDS x TUITIONFEE_IN)Step 7: Remove every row with missing data.
colleges_clean <- colleges_clean |>
drop.na()Lastly, notice that each of these steps started with
colleges_clean <- colleges_clean |> ...That is pretty redundant! Instead, we could perform all these tasks as one long “pipeline.”
Step 8: Combine your (fixed) code chunks into a single code chunk that carries out all of the steps necessary to clean the data.
Think about coding efficiency – you should not have multiple calls to the same function!
# Code combining ALL of your previous steps into ONE pipelinePart Two: Identify the Mystery College
Wow! Your best friend Margaret has been accepted to her top choice college! Unfortunately, Margaret is a very mysterious person, and she won’t tell you directly which college this is. You’ll have to use her clues to figure out which school is her dream school.
Clues:
This college is located in Region 4.
This college’s admission rate is below the median rate for the region.
This college does charge the same for in- and out-of-state tuition.
The average SAT score of this college is an even number.
This college is in the state where gooey butter cake, the Chiefs, and the Anheuser-Busch Clydesdales are a thing.
More than 1,000 undergraduates attend this college.
The total amount of money made from tuition per year for this college is not less than $50,000,000.
Of the three options remaining at this step, Margaret will attend the college that has the lowest admission rate.
Submit the college Margaret will attend to the Canvas Quiz.